Metagenomics is the study of environments through genetic sampling of theirmicrobiota. Metagenomic studies produce large datasets that are estimated togrow at a faster rate than the available computational capacity. A key step inthe study of metagenome data is sequence similarity searching which iscomputationally intensive over large datasets. Tools such as BLAST requirelarge dedicated computing infrastructure to perform such analysis and may notbe available to every researcher. In this paper, we propose a novel approach called ScalLoPS that performssearching on protein sequence datasets using LSH (Locality-Sensitive Hashing)that is implemented using the MapReduce distributed framework. ScalLoPS isdesigned to scale across computing resources sourced from cloud computingproviders. We present the design and implementation of ScalLoPS followed byevaluation with datasets derived from both traditional as well as metagenomicstudies. Our experiments show that with this method approximates the quality ofBLAST results while improving the scalability of protein sequence search.
展开▼